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Abstract 

Background: Autism spectrum disorder (ASD) is well recognized to be genetically heterogeneous. It is assumed 
that the genetic risk factors give rise to a broad spectrum of indistinguishable behavioral presentations. 

Methods: We tested this assumption by analyzing the Autism Diagnostic Interview-Revised (ADI-R) symptom 
profiles in samples comprising six genetic disorders that carry an increased risk for ASD (22ql 1.2 deletion, Down's 
syndrome, Prader-Willi, supernumerary marker chromosome 15, tuberous sclerosis complex and Klinefelter 
syndrome; total n = 322 cases, groups ranging in sample sizes from 21 to 90 cases). We mined the data to test the 
existence and specificity of ADI-R profiles using a multiclass extension of support vector machine (SVM) learning. 
We subsequently applied the SVM genetic disorder algorithm on idiopathic ASD profiles from the Autism Genetics 
Resource Exchange (AGRE). 

Results: Genetic disorders were associated with behavioral specificity, indicated by the accuracy and certainty of 
SVM predictions; one-by-one genetic disorder stratifications were highly accurate leading to 63% accuracy of correct 
genotype prediction when all six genetic disorder groups were analyzed simultaneously. Application of the SVM 
algorithm to AGRE cases indicated that the algorithm could detect similarity of genetic behavioral signatures in 
idiopathic ASD subjects. Also, affected sib pairs in the AGRE were behaviorally more similar when they had been 
allocated to the same genetic disorder group. 

Conclusions: Our findings provide evidence for genotype-phenotype correlations in relation to autistic 
symptomatology. SVM algorithms may be used to stratify idiopathic cases of ASD according to behavioral 
signature patterns associated with genetic disorders. Together, the results suggest a new approach for 
disentangling the heterogeneity of ASD. 



Background 

Autism spectrum disorder (ASD) is a behaviorally de- 
fined syndrome characterized by variable abnormalities 
in social interactions and communication, in association 
with restricted interest patterns and unusual stereotyped 
behaviors. There has been a concerted effort over the 
last 20 years to identify causal genetic risk factors and as 
a result, an increasing number of rare, highly penetrant 
genetic variants are being implicated [1]. When present, 
these rare variants are thought to account for a large 
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proportion of an individuals genetic liability to the con- 
dition. Currently, specific genetic etiologies, including 
rare single nucleotide and copy number variants (CNVs) 
as well as larger chromosomal variations, can be identi- 
fied in around 15 to 20% of patients [2-5]. These find- 
ings highlight the complexity of the genetic architecture 
and heterogeneity of ASD and indicate that by using 
standard case- control designs, extremely large sample 
sizes will be required to unravel the heterogeneity and 
map the dysregulated signaling pathways involved in the 
pathophysiology of ASD [4,6-9]. 

The variability in phenotypic expression of autism ob- 
served in monozygotic twin pairs, coupled with the evi- 
dence from molecular genetic studies supporting a 
polygenic multi-factorial liability model has led to the 
recognition that the many genetic risk factors for autism 
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give rise to a broad spectrum of behavioral presentations 
and hence the concept of autism as a spectrum disorder. 
The adoption of this model has led to an implicit as- 
sumption that specific genotype-phenotype correlations 
are unlikely to exist. However, there is evidence that 
ASD symptoms may be dissociable at the genetic level. 
Different genetic linkage regions have been obtained for 
social interaction and repetitive behavioral domains in 
ASD patients [10], and distinct developmental trajector- 
ies of social and repetitive behavior exist in the ASD 
population [11]. Moreover, in recent years, a growing 
interest has developed in the possibility that particular 
genetic disorders may give rise to characteristic patterns 
of autistic symptomatology. This interest is based on the 
assumption that perturbations in associated pathophysio- 
logical pathways would lead to relatively constrained and 
more specific phenotypic outcomes [12]. Indeed, a num- 
ber of recent studies, involving a variety of genetic condi- 
tions including 16pll.2 and 7qll.23 CNVs, Williams 
syndrome, fragile X syndrome and neurofibromatosis, 
have indicated the existence of genetic disorder-specific 
behavioral profiles that encourage further efforts in this 
direction [4,13-16]. Building on these findings, we postu- 
lated that well-defined genetic conditions could give rise 
to relatively distinct patterns of autistic symptomatology. 
The designation of these patterns may be relevant to dis- 
sect ASD heterogeneity as other risk factors that perturb 
converging pathophysiological pathways, for example re- 
lated to the genetic conditions, might lead to similar pat- 
terns of autistic symptomatology. 

In the present study, we have undertaken a proof of 
concept study to determine if these genotype-phenotype 
correlations exist and whether they could be useful to 
disentangle the heterogeneity of ASD and complement 
future genetic studies. Support vector machine (SVM) 
learning was used to analyze signatures' of autistic 
symptomatology in six genetic developmental disorders 
associated with an increased risk for ASD [17-20]. Based 
on the premise that other risk factors which dysregulate 
the same pathways may give rise to similar signature' 
patterns of behavior, we aimed to apply the SVM algo- 
rithms derived from genetic disorders to cases of idio- 
pathic ASD. Finally, we investigated whether the SVM 
algorithm would detect enhanced behavioral similarity 
in affected sib pairs from the Autism Genetics Resource 
Exchange (AGRE) multiplex families. Figure 1 provides 
an overview of the different steps involved in the study. 

Methods 

Subjects 

The six genetic disorders we included in the study were: 
22qll.2 deletion syndrome (22qllDS), Downs syndrome 
(DS) [21], Prader-Willi syndrome (PWS), supernumerary 
marker chromosome 15 (SMC15), tuberous sclerosis 



complex (TSC) and Klinefelter syndrome (XXY); total n = 
322 cases, groups ranging in sample size from 21 to 90 
cases. Cases were recruited through patient associations/ 
charities or centers for clinical genetics or pediatrics as 
part of a collaborative effort between the Department of 
Psychiatry of the University Medical Centre in Utrecht in 
the Netherlands and the Institute of Psychiatry, Kings 
College London in the UK. Appropriate local ethical board 
approval was obtained (Medical Research Ethics Commit- 
tee, METC, of the University Medical Centre in Utrecht 
and the College Research Ethics Committee, CREC, in 
London). Informed consent for each participant in the co- 
horts was obtained and included the use of data for the 
analysis we carried out for this paper. The genetic disor- 
ders had been diagnosed through clinical genetic centers 
and confirmed by routine molecular and cytogenetic ana- 
lysis. The total sample consisted of 322 verbal subjects. 
Each of the six genetic disorders has previously been 
shown to be associated with an increased risk of ASD 
[6,7,22-25]. The cases were drawn from studies that had 
originally been designed to elucidate the behavioral phe- 
notypes associated with each of the six genetic disorders 
[22-27]. As far as possible, the samples were ascertained 
without reference to the presence of ASD. For more de- 
tails on recruitment procedures and inclusion criteria for 
the genetic disorder subtypes please see previous publica- 
tions [22-26]. All subjects were included in the analyses, 
regardless of the presence of an ASD diagnosis, in order 
to evaluate the widest range of symptom profiles. How- 
ever, for technical reasons concerning the measurement of 
ASD symptomatology, only verbal individuals were in- 
cluded in the analyses. Estimates of intellectual abilities 
were available for the majority of subjects (>80%) and had 
been assessed by different standardized measures accord- 
ing to age and ability level [28-32]. Table 1 shows the sam- 
ple characteristics. 

The AGRE database was used for the selection of idio- 
pathic subjects (http://www.agre.org) [33,34]. AGRE 
cases were included in the analyses if they fulfilled Aut- 
ism Diagnostic Interview-Revised (ADI-R) criteria for an 
ASD and complete ADI-R algorithm data were available 
(see criteria). All verbal simplex probands in the AGRE 
cohort with complete ADI-R algorithm data and scoring 
above the ASD threshold (n = 375) were assigned the 
label AGREO'. Among the multiplex families we identi- 
fied all verbal affected sib pairs. Within these affected 
pairs one sib was allocated to AGREl' while the other 
was allocated to AGRE2'. Therefore, AGREl and AGRE2 
consisted of those verbal subjects with ASD with at least 
one related verbal sibling with ASD (both n = 433). 

Measures 

Autism symptom variables were extracted from the 
ADI-R which was used to interview the parents of each 
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ASDs are characterized by variable profiles of characteristic behaviors 
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Figure 1 (See legend on next page.) 
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Figure 1 Overview of the different steps undertal<en in the study. Step 1 : development of SVM classifier to assess the presence and strength 
of behavioral signatures among genetic syndromes. Step 2: application of the classifier derived in step 1 to ACRE samples to test if similarity in 
behavioral signatures can be detected among idiopathic ASD subjects. Step 3: application of classifier derived in step 1 to sibling pairs with 
idiopathic ASD (ACRE) to test relative familiality of behavioral signatures derived from genetic syndromes. AGRE, Autism Genetics Resource 
Exchange; ASD, autism spectrum disorder; SVM, support vector machine. 



subject [35]. The ADI-R is an established interview 
schedule for assessing autism diagnoses but may also be 
used to assess profiles of autistic symptomatology 
[36,37], and as phenotype variables in large genetic 
population studies of ASD [38-41]. The interview fo- 
cuses on identifying key symptoms that characterize the 
syndrome [12,36,37]. A subset of 37 items from the 
ADI-R is used to create a diagnostic algorithm, which 
documents behaviors reported between the 4th and 5th 
birthday, regarded as the optimal window to detect 
ASD. As a consequence, the use of the diagnostic 
algorithm data minimalizes the possible confound of 
age-related developmental effects on symptomatology. 
ADI-R items are scored as: 0, no ASD behavioral symp- 
tom present; 1, specified behavior definitely present but 
not clearly enough to warrant a code of 2; or 2, specified 
ASD symptom definitely present. In addition, for some 
items a code of 3 is given, if the behavior impacts mark- 
edly on or disrupts family life. Accordingly, when com- 
puting the algorithm scores, a code 3 is recoded as a 2. 
For this study, we used these algorithm scores, with a 
range of 0 to 2 instead of 0 to 3, to assign equal weight 
to all items entered in the analyses. Because certain 
symptoms of the communication impairments charac- 
terizing ASD can only be observed in verbal individuals, 
there are separate scores for verbal and non-verbal indi- 
viduals. An overview of the description of the ADI-R 
items and the ADI-R domains of the algorithm is pro- 
vided in Table 2. The classification of an ASD in this 



Table 1 Characteristics of the total genetic disorder sample 



Genetic 
disorder 


N 






Age 

(months) 


ASD 




Total 


Female 


Male 


Yes 


No 


22qllDS 


90 


42 


48 


162.5 ±33.6 


40 


50 


Down's 


21 


16 


5 


169.1 ±32.6 


6 


15 


PWS 


88 


48 


40 


191.9± 141.0 


20 


68 


SMC15 


22 


8 


14 


1 61.4 ± 103.6 


19 


3 


TSC 


50 


31 


19 


126.2 ±74.0 


22 


28 


XXY 


51 


0 


51 


145.4 ±41.4 


16 


35 


Total 


322 


145 


177 




123 


199 



study was based on ADI-R criteria used in genetic stud- 
ies and the AGRE collection: ASD is diagnosed when 
scores in all domains are met or when scores are met in 
two core symptom domains, in addition to the age of 
onset' domain, but are one point away from meeting 
autism criteria in the one remaining core symptom do- 
main [35,42]. Reliability of the ADI-R in a population 
with mild to moderate mental retardation has been 
established [43]. 

Statistical analysis 

Standard principal component analysis (PCA) of ADI-R 
item scores was used to investigate the extent of overlap be- 
tween the symptom profiles of the different genetic groups. 

The SVM method was used as a supervised learning 
method (incorporating the knowledge of the genotype) to 
classify genotype membership on the basis of ADI-R item 
scores. SVM is currently one of the most popular machine 
learning methods used in data mining, due to its firm the- 
oretical foundation and proven superiority in applications. 
With regards to SVM, a radial basis kernel function was 
used, with optimal gamma and cost parameter values de- 
termined in a nested n-fold or, equivalently, leave-one-out 
cross-validation (LOOCV) procedure, n being the number 
of observations in the sample. Each observation in turn 
was left out of the sample, and an SVM classifier was opti- 
mized and built on the remaining n - 1 observations. In 
this way, an independent assessment of correctness of the 
predicted class can be achieved for each observation in the 



ADI-R scores per domain and total scores IQ 



1 




II 


III 


Total 






9.8 ± 


6.4 


7.7 ± 4.8 


2.5 ± 2.2 


20.0 ± 


12.0 


67.0 ± 14.1 


7.2 ± 


4.4 


6.8 ± 3.8 


3.2 ±2.0 


17.2± 


8.6 


49.5 ± 1 1 .9 


7.9 ± 


5.1 


5.7 ±4.5 


2.8 ± 2.0 


16.3± 


10.1 


70.9 ±16.3 


15.6 


±6.0 


13.6 ±5.5 


6.5 ± 2.4 


35.7 ± 


12.2 


51.0 ± 19.0 


12.0 


±9.0 


9.6 ± 6.8 


3.7 ±3.3 


25.2 ± 


17.9 


69.3 ± 27.4 


8.5 ± 


6.0 


8.8 ± 5.4 


2.3 ±2.1 


19.6± 


12.0 


80.4 ±13.9 



Average 162.7 ±89.8 9.6 ±6.7 8.0 ±5.6 3.1 ±2.5 20.6 ±13.4 68.6 ±19.2 

Data provided are mean values and, if applicable, standard error of the means. ADI-R domains: I, qualitative abnormalities in reciprocal social interaction; II, 
qualitative abnormalities in communication; and III, restricted, repetitive and stereotyped patterns of behavior. 22q11DS, 22q11.2 deletion syndrome; ADI-R, Autism 
Diagnostic Interview-Revised; ASD, autism spectrum disorder; Down's, Down's syndrome; PWS, Prader-Willi syndrome; SMC15, supernumerary marker chromosome 
15; TSC, tuberous sclerosis complex; XXY, Klinefelter syndrome. 
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Table 2 Autism Diagnostic Interview-Revised (ADI-R) 



algorithm items sorted by number 

Item Item description ADI-R 

number domain 

31 Use of other's body to communicate I 

33 Stereotyped utterances and delayed echolalia III 

34 Social verbalization/chat II 

35 Reciprocal conversation II 

36 Inappropriate questions or statements II 

37 Pronominal reversal II 

38 Neologisms/idiosyncratic language II 

39 Verbal rituals III 

42 Pointing to express interest II 

43 Nodding II 

44 Head shaking II 

45 Conventional/instrumental gestures II 

47 Spontaneous imitation of actions II 

48 Imaginative play II 

49 Imaginative play with peers I 

50 Direct gaze I 

51 Social smiling I 

52 Showing and directing attention I 

53 Offering to share I 

54 Seeking to share enjoyment with others I 

55 Offering comfort I 

56 Quality of social overtures I 

57 Range of facial expressions used to communicate I 

58 Inappropriate facial expressions I 

59 Appropriateness of social responses I 

61 Imitative social play II 

62 Interest in children I 

63 Response to approaches of other children I 

64 Group play with peers (age <10.0 years) I 

65 Friendships (age >10.0 years) I 

67 Unusual preoccupations III 

68 Circumscribed interests III 

69 Repetitive use of objects or interest in parts of III 
objects 

70 Compulsions/rituals III 

71 Unusual sensory interests (highest score of 69/71) III 

77 Hand and finger mannerisms (highest score of 77/78) III 

78 Other complex mannerisms or stereotyped body III 
movements 

ADI-R, Autism Diagnostic Interview-Revised. 



sample, resulting in an independent estimate of the accur- 
acy of SVM on the whole sample. In each one of the 
remaining samples, the optimization with respect to the 
gamma and cost parameter was achieved by applying a 



second LOOCV procedure, in which each of these n - 1 
observations in turn was left out of the sample and SVM 
models were fitted to the remaining n - 2 observations, 
using a grid of combinations of gamma and cost parameter 
values. In a similar fashion as described above, accuracy 
was determined for every combination of gamma and cost 
parameter values on the grid, and the optimal value of 
gamma and cost parameter was determined as the one giv- 
ing the highest accuracy. Finally, an SVM model was fitted 
to the n - 1 observations remaining in the outer loop using 
these optimal values. SVM by nature is a method for binary 
(two group) classification, so a multiclass (k classes) exten- 
sion was used, based on the one-against-one' approach, in 
which k(k - l)/2 binary classifiers are trained; the appropri- 
ate predicted' class is found by a voting scheme, choosing 
the most frequently assigned class by the binary classifiers. 

Thus, the class assigned by SVM is the one with the 
maximum votes from all one-versus-one (2-group) clas- 
sifications, based on the decision values of the 2-group 
classifiers. These decision values can also, post hoc, be 
used to obtain a predicted probability for each class, 
which can be used as outcome parameters to evaluate 
the confidence of SVM predictions. 

The software used was the libSVM program, imple- 
mented through the SVM function in the el071 library 
in R [44]. 

Results 

Identification of behavioral signatures relating to each 
genetic disorder 

As a starting point, we explored the distribution of autism 
symptom profiles in the genetic disorder sample by PCA. 
The PCA plot showed that, on average, some genetic dis- 
order profiles were overlapping where others were more 
clearly separable (Figure 2). This picture indicated that un- 
supervised statistical analysis was not sufficiently sensitive 
to optimally distinguish genetic disorder-related profiles. 
This notion was confirmed following cluster analysis 
(k-means clustering) of the ADI-R data in the genetic dis- 
order sample, which did not identify any relevant clusters 
(data not shown). 

To perform a more sophisticated pattern analysis, we 
turned to machine learning analysis. We used SVM as a 
supervised learning method to investigate genotype- 
phenotype relationships between the six genetic disor- 
ders and the item scores from the ADI-R algorithm. The 
essential difference with the unsupervised PCA or clus- 
tering analysis used above is that the SVM approach 
incorporates the knowledge of the genotype in the ana- 
lysis. The SVM allocations to genetic disorder groups 
occurred in two steps. First, the SVM analyzed 2-group, 
one-against-one' comparisons. Subsequently, the multi- 
class extension was used to select the most appropriate 
predicted' genetic disorder class for each subject on the 
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Figure 2 PCA plot of ADI-R profiles of subjects in the genetic 
disorder sample. Colors/numbers denoting genetic disorder 
subgroups. 1, 22ql 1.2 deletion syndrome; 2, Down's syndrome; 3, 
Prader-Willi syndrome; 4, supernumerary marker chromosome 15; 5, 
tuberous sclerosis complex; 6, Klinefelter syndrome. ADI-R, Autism 
Diagnostic Interview-Revised; PCA, principal component analysis. 



basis of the most frequently assigned class by the binary 
classifiers. The binary one-by-one comparisons showed 
high accuracies of up to 97% of correct genetic group al- 
locations (Table 3). As a result, a total of 63% of cases 
was correctly allocated by the multiclass comparison 
using the LOOCV method, whereas random prediction 
(without prior knowledge of genetic group) would have 
resulted in 21% accuracy (Table 4). Interestingly, in all 
groups apart from DS, the averages of the post-hoc pre- 
dicted probabilities were highest for the corresponding 
genetic disorder class, indicating that the SVM algorithm 
was able to predict correct disorder classes with a high 
degree of confidence (Table 4). 

To further evaluate the validity of the prediction model, 
we investigated the correlation between the predicted 

Table 3 One-by-one SVM comparisons in the genetic 



disorder sample 



Genotype 


SVM accuracy (%) 












22q11DS 


Down's 


PWS 


SMC15 


TSC 


XXY 


22qllDS 


NA 


0.89 


0.91 


0.97 


0.90 


0.82 


Down's 


0.89 


NA 


0.77 


0.84 


0.82 


0.87 


PWS 


0.91 


0.77 


NA 


0.84 


0.86 


0.86 


SMC15 


0.97 


0.84 


0.84 


NA 


0.94 


0.88 


TSC 


0.90 


0.82 


0.86 


0.94 


NA 


0.72 


XXY 


0.82 


0.87 


0.86 


0.88 


0.72 


NA 



22q11DS, 22q11.2 deletion syndrome; Down's, Down's syndrome; NA, not 
applicable; PWS, Prader-Willi syndrome; SMC15, supernumerary marker 
chromosome 15; SVM, support vector machine; TSC, tuberous sclerosis 
complex; XXY, Klinefelter syndrome. 



probabilities and the proportion of cases correctly assigned 
to each genetic group, based on LOOCV output This 
tests the expectation of the model that higher probabilities 
reflect greater confidence in prediction, as shown by in- 
creasing correctness' in classification. We observed a sig- 
nificant correlation {P = 0.002) between the predicted 
probabilities and the likelihood of correct classification, 
which provides support for the robustness of the model 
and encouraged us to test the classifier in further samples. 

We were interested to identify which behaviors con- 
tributed most to the predictions by SVM. Therefore, the 
importance (weight) of each of the ADI-R items to the 
SVM classifier was extracted. The result of this analysis 
showed that four of the top five most influential items 
pertained to ASD symptoms that related to the quality 
of social interaction (Table 5). By contrast, the five least 
influential items were more concerned with aberrant 
communication and repetitive behaviors. 

It was notable that the predicted probabilities in 
SMC15 cases were also relatively high for prediction to 
the PWS group. This seemed plausible, as both disorders 
are associated with differences in the 'dosage' of genes 
located in chromosome 15qll-13. By contrast, SMC 15 
could be clearly discriminated from 22qllDS by SVM, 
which corresponded with a lack of overlap in the PCA 
between these two groups (Figure 2). Interestingly, SMC15 
and 22qllDS are both characterized by low average 
intelligence, suggesting that the behavioral differences 
are independent of general intellectual ability. To rule out 
the influence of IQ on prediction accuracy, we re-analyzed 
the data, including IQ as an additional predictor. The 
average accuracy of the SVM predictions was essentially 
unchanged (63.0% versus 62.5%), indicating that IQ was 
not a confounding factor. The poor prediction for the DS 
group was due to a frequent misallocation to the PWS 
group; 17 of the DS cases were being incorrectly assigned 
to the PWS group. Indeed, an overlap between DS and 
PWS groups was also apparent in the PCA of the symp- 
tom profiles (Figure 2). 

We also tested the accuracy of SVM class assignment 
among the subset of individuals who scored above the 
ADI-R threshold for ASD (n = 123). This resulted in 
similar assignment accuracies and predicted probabilities 
(data not shown). In subsequent analyses we used the al- 
gorithm derived from all patients from our genetic dis- 
order samples, irrespective of whether they met formal 
criteria for ASD diagnosis, since from a clinical perspec- 
tive, we also wanted to include the profiles of subjects 
who scored below ADI-R thresholds for ASD. 

Testing the SVM classification algorithm in idiopathic ASD 

Next, we considered whether the genetic disorder algo- 
rithm could detect a degree of similarity in patterns of 
autistic behavior in a sample of 'idiopathic' cases. To test 
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Table 4 Leave-one-out cross-validation (LOOCV) results for the SVM model on ADI-R items for the genetic 
disorder sample 

Genetic SVM frequency of assigned class and predicted probabilities 





























Total 


22q11DS 


Down's 


PWS 




SIVIC15 


TSC 




XXY 






n 


Probability 


n 


Probability 


n 


Probability 


n 


Probability 


n 


Probability 


n 


Probability 


n 


22qllDS 


74 


0.602 


1 


0.092 


7 


0.095 


0 


0.032 


6 


0.15 


14 


0.215 


102 


Down's 


0 


0.027 


1 


0.166 


1 


0.103 


0 


0.097 


0 


0.03 


0 


0.047 


2 


PWS 


7 


0.099 


18 


0.485 


68 


0.537 


9 


0.301 


7 


0.12 


6 


0.167 


115 


SMC15 


0 


0.02 


0 


0.11 


4 


0.089 


10 


0.326 


0 


0.05 


2 


0.067 


16 


TSC 


2 


0.103 


0 


0.044 


3 


0.068 


0 


0.064 


30 


0.43 


9 


0.185 


44 


XXY 


7 


0.149 


1 


0.103 


5 


0.108 


3 


0.179 


7 


0.22 


20 


0.319 


43 


Total 


90 




21 




88 




22 




50 




51 




322 


Accuracy 


74/90 (82%) 


1/21 (10%) 


68/88 


1 (77%) 


10/22 (45%) 


30/50 (60%) 


20/51 (39%) 


203/322 (63%) 



22q11DS, 22q11.2 deletion syndrome; ADI-R, Autism Diagnostic Interview-Revised; Down's, Down's syndrome; LOOCV, leave-one-out cross-validation; PWS, 
Prader-Willi syndrome; SMC15, supernumerary marker chromosome 15; SVM, support vector machine; TSC, tuberous sclerosis complex; XXY, Klinefelter syndrome. 



this hypothesis, we applied the algorithm to ADI-R data 
obtained from the AGRE dataset in order. It should be 
noted that the AGRE sample functioned as a 'blind' sam- 
ple in this context, as we could not validate the outcome 
with genetic labels. Therefore, we performed analyses to 
indicate if the algorithm would detect meaningful associ- 
ations or if these would not differ from random associa- 
tions, for example not informed by genetic disorder 
labels. Thus, we generated randomly permuted ADI-R 
item data from the AGREO dataset and compared the 
distribution of predicted probabilities in the real (AGREO 
and genetic disorder sample) compared to the randomly 
generated data. The probabilities differed significantly 
between these groups. As expected, the highest pre- 
dicted probabilities were observed among the genetic 
disorder cases. Indeed, the lowest probabilities were ob- 
served in the randomly generated AGRE subsample. 
There was also a significant difference between the gen- 
etic groups and AGREO (P = 0.0024), between the gen- 
etic groups and random data {P <0.001) and between 
AGREO and random data (Figure 3). Most importantly. 



the probabilities in AGREO were significantly higher 
than those in the randomly configured data {P <0.001). 
This indicated that the algorithm derived from the gen- 
etic disorders detected non-random pattern information. 

Subsequently, we applied the genetic disorder classifier 
to the AGREO sample to analyze the distribution of gen- 
etic disorder allocations in the blind AGRE subsamples. 
The genetic disorder algorithm assigned the highest 
probabilities and most cases to the TSC group and the 
lowest probabilities and fewest cases to the DS and PWS 
groups. We observed a similar distribution of SVM pre- 
dicted probabilities in the AGREl and AGRE2 samples, 
essentially replicating the result obtained for AGREO. 
Again, TSC was by far the most commonly assigned 
class, whereas DS and PWS were the least frequently 
assigned classes. The predicted probabilities and group 
predictions for AGREO, AGREl and AGRE2 are summa- 
rized in Table 6. It should be noted that these predic- 
tions were achieved by forcing all individuals into one of 
the six categories, which means that frequent allocation 
should be interpreted as indicative of relative phenotype 



Table 5 ADI-R items that contributed most and least to the result of the SVM analysis on the genetic syndrome sample 



Lowest five ADI-R items Top five ADI-R items 

Item number Item description Item number Item description 



70 


Compulsions/rituals 


63 


Response to approaches of other children 


38 


Neologisms/idiosyncratic language 


49 


Imaginative play with peers 


58 


Inappropriate facial expressions 


64, 65 


Group play with peers/friendships 


39 


Verbal rituals 


56 


Quality of social overtures 


37 


Pronominal reversal 


68 


Circumscribed interests 



ADI-R, Autism Diagnostic Interview-Revised; SVM, support vector machine. 
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similarity. As such, the application of the genetic dis- 
order classifier to AGRE samples seemed to indicate en- 
hanced relative similarity of AGRE profiles to the TSC 
group. To support this notion, we plotted the AGREO 
ADI-R profiles in the PGA plot of the genetic disorder 
sample, which confirmed that, on average, the TSC group 
displayed most similarity to AGREO (Figure 4). In addition, 
22qllDS, SMC15 and XXY groups also displayed some 
closeness to AGREO, which seems also reflected in their 
occasional allocation by the genetic disorder classifier. 

We contrasted these predictions in the AGRE sample 
with random predictions; we generated SVM models by 
randomly permuting the six labels relating to the genetic 
disorders. Thus, random genetic labels were linked to 
the existing symptom profiles, thereby destroying the 
original relationship between ADI-R score profiles and 
the genetic groups. By analyzing the allocations arising 
from these random classifier algorithms, we could check 
which distribution of allocation would arise by chance, 
that is not informed by existing genetic disorder profiles. 
We repeated this exercise 1,000 times in order to gain ro- 
bust results. The results showed that most were assigned 
to the 22qllDS and PWS groups. This result was most 
likely due to the fact that these disorders were the two lar- 
gest groups in the genetic disorder sample. It should be 
noted that this result was strikingly different than the allo- 
cation in AGRE by the randomly permuted genetic labels. 

Together, these analyses on blind AGRE samples indi- 
cated that the algorithm of the genetic disorder sample 
could detect an extent of relative similarity in ADI-R 
profile patterns among idiopathic subjects. 
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Figure 3 SVM predicted probabilities of the original genetic 
groups, AGREO singleton dataset and randomly generated 
scores for the AGREO singleton dataset. Mean SVM probabilities 
differed significantly between the genetic groups and AGREO 
(P= 0.0024), between the genetic groups and random data 
(P <0.001) and between AGREO and random data {P <0.001). SVM, 
support vector machine. 
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Behavioral signatures In sibling pairs with Idiopathic ASD 

To test our expectation that the signature patterns derived 
from the genetic disorders relate to genotype-phenotype 
associations, we hypothesized that the affected sib (sibling) 
would be significantly more often assigned to the same 
genetic disorder class and be relatively more similar in 
their behavioral profile than non-related subjects. To test 
this, we examined the concurrence in class assignment (X- 
square) and correlation between affected sib pairs in the 
SVM assigned class and predicted probabilities. 

Significant dependence between the class assignment of 
siblings in AGREl and the other sibling in AGRE2 was in- 
dicated (X-squared = 43, df=25, P = 0.015). Furthermore, 
the predicted probabilities for the assigned class in AGREl 
(sibl) were significantly correlated with the predicted prob- 
abilities of their affected sibling AGRE2 (sib2) (Pearsons 
correlation r = 0.20, P <0.001) (Figure 5). To exclude the 
possibility that these correlations were driven by severity 
rather than specificity of ADI-R profiles, we found that the 
severity of the proband symptom scores did not predict the 
predicted probability of its sibUng, while the predicted 
probability scores did predict the probability score of 
the sibling (sibling 1 as predictor of sibling 2: mean 
items score P = 0.18; probability score P= 1.5e-05; sibling 
2 as predictor of sibling 1: mean items score P = 0.86; 
probability score P = 7e-05). 

Interestingly, the correlation in prediction probabilities 
was driven by a correlation (r = 0.35) between sib pairs 
assigned to the same class compared with 'discordant' sibs 
(r = -0.18), that is sibling pairs that had not been assigned 
to the same class. In addition, we found that the covari- 
ance in probabilities between sibs was greater when both 
sibs were assigned to the same genetic disorder class (F- 
test for equality of variances of the difference in probabil- 
ity, P <0.001). To confirm the notion of enhanced behav- 
ioral similarity between siblings allocated to the same 
genetic disorder class, we examined the ADI-R scores dir- 
ectly. We used the first principle component (PCI) of the 
ADI-R scores as a summary measure. Overall (disregard- 
ing genetic disorder class), the PCls of sibs were not sig- 
nificantly correlated (r = 0.081, P = 0.089), but when split 
out for concordance of genetic disorder prediction, the 
correlations were 0.71 and -0.16 for concordant sibs and 
discordant sibs, respectively, with P <0.001 for concord- 
ant' versus 'discordant' sibs. Overall, the sibling analysis in- 
dicated that the familial liability to ASD may be 
partitioned according to the relative likelihood of disturb- 
ance related to certain genetic disorders. 

Discussion 

This study demonstrates that patterns of autistic symptom- 
atology can be associated with specific genetic disorders. 
There has been much speculation that such genotype- 
phenotype correlations exist but so far only limited evidence 
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Table 6 Application of the SVM algorithm derived from the genetic disorder sample to the different AGRE datasets 

Genetic AGREO AGRE1 AGRE2 

disorder 





n 


% assigned 


Mean 

probability 


SD 

probability 


n 


% assigned 


Mean 

probability 


SD 

probability 


n 


% assigned 


Mean 

probability 


SD 

probability 


22qllDS 


26 


6.9 


0.44 


0.154 


23 


5.2 


0.44 


0.143 


28 


6.3 


0.48 


0.189 


Down's 


1 


0.3 


0.25 


NA 


1 


0.2 


0.28 


NA 


1 


0.2 


0.27 


NA 


PWS 


1 


0.3 


0.34 


NA 


5 


1.1 


0.30 


0.131 


5 


1.1 


0.33 


0.072 


SMC15 


24 


6.4 


0.40 


0.086 


28 


6.3 


0.40 


0.102 


32 


7.2 


0.41 


0.093 


TSC 


255 


68 


0.61 


0.139 


302 


68.2 


0.62 


0.134 


283 


63.9 


0.60 


0.140 


XXY 


68 


18.1 


0.41 


0.092 


84 


19 


0.41 


0.071 


94 


21.2 


0.42 


0.095 


Total 


375 


100 






443 


100 






443 


100 







% assigned, percentage assigned to the respective genetic disorder class; 22q11DS, 22q11.2 deletion syndrome; AGRE, Autism Genetics Resource Exchange; 
Down's, Down's syndrome; mean probability, average predicted probability; n, number of assigned cases to the respective genetic disorder class; NA, not 
applicable; PWS, Prader-Willi syndrome; SD probability, standard deviation of predicted probability; SMC15, supernumerary marker chromosome 15; SVM, support 
vector machine; TSC, tuberous sclerosis complex; XXY, Klinefelter syndrome. 



to support the conjecture. Our results are consistent with 
findings from animal research and suggest that different 
pathophysiological pathways underlie certain behavioral 
deficits [4,45]. 

The current study is the first to test the specificity of 
genetic behavioral phenotypes using a machine learning 
paradigm. The ADI-R algorithm items comprised a com- 
paratively small number of symptom features, yet we 
used this small set of items to classify our cases. The 
total number of correct allocations (63%) was substantial 



given the fact that five groups were compared. Indeed, 
this result was derived from one-by-one genetic disorder 
comparisons, in which strong contrast were evident. It 
was notable, however, that the SVM algorithm derived 
from the current sample differentiated between some 
classes better than others. This variability might be ex- 
plained by the variation in sample sizes; thus, in future lar- 
ger samples will need to be investigated. It was also 
notable that the ratings of the pattern of social dysfunction 
were among the best contributors to class prediction. 




7- 343 

\ \ \ \ \ r 
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Figure 4 PCA plot of ADI-R profiles of subjects in the genetic disorder sample, with the AGREO subsample inserted. PC2 is the 

dimension with the most differentiating contrast among the genetic disorder groups. AGREO, on average, has negative values on PCI and is 
around 0 on PC2. The TSC group (5) is also on average 0 on PC2 similar to AGREO and has the most negative average on PCI . Groups 1 , 4 and 6 
also display some closeness to AGREO. Colors/numbers/letters denote genetic disorder subgroups. 1, 22ql 1.2 deletion syndrome; 2, Down's 
syndrome; 3, Prader-Willi syndrome; 4, supernumerary marker chromosome 15, 5, tuberous sclerosis complex, 6, Klinefelter syndrome; A, AGREO. 
ADI-R, Autism Diagnostic Interview-Revised; PCA, principal component analysis; TSC, tuberous sclerosis complex. 
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Figure 5 Correlation of SVM predicted probabilities between 

AGRE siblings. ACRE, Autism Genetics Resource Exchange; SVM, 

support vector machine. 
\ ' 



raising the possibility that particular styles of social im- 
pairment may be related to particular genetic risk factors. 
Although differences in the typology of social impairments 
have been noted in ASD [46], differences in the types of 
social impairment have not been studied in detail and are 
only partially captured by the ADI-R items. For instance, 
social avoidance is commonly reported in fragile X syn- 
drome, as another example of social behavioral specificity 
within a genetic disorder associated with ASD [47,48]. It 
seems likely that with the incorporation of more symp- 
toms and other phenotypic features, such as the presence 
of comorbid behavioral problems like those associated 
with ADHD [49], the ability to assign cases to specific 
classes of genetic disorder may be improved. The inclu- 
sion of other conditions such as fragile X syndrome may 
also help further map the patterns of genotype-phenotype 
correlations. Together, these extensions may reveal further 
contrasts or overlaps between genetic disorders that are 
biologically meaningful. For instance, it was already inter- 
esting that the prediction probabilities for SMC15 were 
similar to those for PWS. Both disorders are associated 
with abnormalities in the dosage of genes located in the 
15qll-13 region and likely lead to perturbations in similar 
pathophysiological pathways. 

The subjects of this study were included because they 
were ascertained for the presence of a genetic disorder 
and were assessed regardless of the presence or absence 
of behavioral concerns. Although this approach is likely 
to have minimized ascertainment biases, some bias can- 
not be ruled out. However, any enrichment of behavioral 
abnormalities in these cohorts is unlikely to give rise to 
the specific patterns of associations identified here. It 
was reassuring in this respect that the algorithm derived 



from all cases in the genetic disorder samples gave com- 
parable results to the analyses that included only the 
subjects who scored above the ADI-R threshold for 
ASD. Analysis confirmed that IQ did not seem to act as 
a confounding factor in the SVM predictions. Also, the 
influence of age and medication as cofounds could be 
ruled out, as the ADI-R algorithm codes behaviors be- 
tween 4 and 5 years old [35]. 

The application of the genetic disorder algorithm to 
AGRE samples indicated that the behavioral patterns ob- 
served in cases of idiopathic autism were not random. 
Therefore, these results could be used to estimate rela- 
tive similarity to behavioral profiles designated from 
the genetic disorders. In addition, the sibling analysis 
showed correlation of SVM predictions between affected 
sib pairs. These findings indicate the feasibility to parti- 
tion familiality into components according to patterns of 
autistic symptomatology, for example concordance in 
relative similarity to behavioral profiles related to the 
genetic disorders. This notion should be followed up by 
studies that incorporate genetic or pathway information 
to ascertain the behavior-based stratification in idio- 
pathic samples. For instance, our allocation in idiopathic 
ASD to TSC-derived patterns may be supported by mo- 
lecular data showing mammalian target of rapamycin 
(mTOR) pathway deregulation. Such a result would sup- 
port the view that perturbation of the mTOR signaling 
cascade is a common pathophysiological feature of hu- 
man neurological disorders, including mental retardation 
syndromes and ASDs [49]. If confirmed, such results 
could complement future gene searches, since stratifica- 
tion on the basis of behavioral profile may significantly 
increase the power to detect which (combination of) 
genetic disorder related pathways are most prominently 
involved. Indeed, the notion that pathophysiological pro- 
cesses are shared in syndromic and idiopathic cases of 
ASD is supported by a recent study that showed conver- 
ging synaptic pathophysiology between syndromic (for 
example as a cause of a defined genetic disorder) and 
non-syndromic rodent models of autism [50]. Moreover, 
genotype stratification may also have important treat- 
ment implications, as other animal studies suggest that 
the best treatment approaches for some genetic disor- 
ders (for example fragile X syndrome) may be unsuitable 
for others (for example tuberous sclerosis) [49]. 

Conclusion 

Our proof of concept study indicates the existence of 
signature' autistic behavioral profiles that index under- 
lying genetic risk processes. These signatures may be 
helpful in disentangling the etiological and phenotypic 
heterogeneity evident in ASD, but warrant replication in 
larger and independent samples. The approach pre- 
sented in this study could hold promise as a means of 
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stratifying patients who may benefit from treatments tar- 
geted at specific pathways and as a way of identif)^ing 
those patients in whom interventions may have un- 
wanted effects. 
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